-
Notifications
You must be signed in to change notification settings - Fork 269
Replace nested static_for lambdas with compile-time search helper #3600
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
tenpercent
wants to merge
6
commits into
develop
Choose a base branch
from
mpodkory/find-transform-optimization
base: develop
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+276
−72
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This was referenced Jan 16, 2026
f5ada17 to
9942fd6
Compare
e6040e1 to
a565d87
Compare
c4d95f7 to
631df4f
Compare
1d351ad to
ec8e794
Compare
The GetTransformAndItsUpperDimension function used nested static_for loops with lambdas to search for a hidden dimension in UpperDimensionIdss. This caused 918 applier::operator() instantiations (81% of all applier instantiations). Replace with find_in_tuple_of_sequences helper that uses constexpr array lookup and if-constexpr recursion, eliminating the lambda instantiation overhead. Results on example_grouped_conv_fwd_xdl_fp16: - applier instantiations: 1132 -> 127 (89% reduction) - TensorDescriptor instantiations: 2503 -> 664 (73% reduction) - Template instantiation time: 23.4s -> 19.4s (17% reduction)
…tSize The InitializeElementSize function used generate_tuple with a lambda to compute visible dimension lengths. Each TensorDescriptor type created a unique lambda type, causing 78 instantiations (385ms). Replace with direct pack expansion using helper functions, eliminating the lambda instantiation overhead entirely. Results on example_grouped_conv_fwd_xdl_fp16: - generate_tuple lambdas: 178 -> 100 (44% reduction) - Template instantiation time: 19.5s -> 19.0s
TensorAdaptor has identical InitializeElementSize and GetTransformAndItsUpperDimension patterns as TensorDescriptor. Apply the same optimization: - Replace nested static_for lambdas with find_in_tuple_of_sequences - Replace generate_tuple lambda with pack expansion Results: generate_tuple lambdas 100 -> 96 (4 events, 17ms eliminated)
ec8e794 to
83a76d7
Compare
Detailed comments explain: - sequence_find_value: Constexpr loop with O(1) template depth vs O(N) recursive - find_in_tuple_of_sequences: Pack expansion instead of nested static_for loops - Why constexpr search reduces template instantiations dramatically - When to apply constexpr search patterns for compile-time operations - Implementation details for each optimization approach This documentation helps maintainers understand the compile-time search optimization strategy without relying on specific benchmark numbers that may vary by use case.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
find_in_tuple_of_sequencescompile-time search helper with O(1) template depthstatic_forlambdas inTensorDescriptor::GetTransformAndItsUpperDimensiongenerate_tuplelambda inTensorDescriptor::InitializeElementSizewith pack expansionTensorAdaptorMotivation
The
TensorDescriptorandTensorAdaptorclasses had excessive template instantiation from:static_forloops with lambdas (918applier::operator()instantiations)generate_tuplewith lambdas (78+ instantiations per class)Why It Works
Each lambda creates a unique closure type, causing separate instantiations at every call site. The
find_in_tuple_of_sequenceshelper uses O(1) template depth via pack expansion instead of O(N) nestedstatic_forrecursion, and named functors share a single type across all uses.Results (example_grouped_conv_fwd_xdl_fp16)
applierinstantiationsgenerate_tuplelambdasTest Plan
sequence_find_valuefind_in_tuple_of_sequencesPR Stack
This PR is part of the build time optimization effort (issue #3575). All PRs now target develop independently:
__make_integer_seqTracking issue: #3575